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1  Introduction 


Current  research  at  RSRE  into  language  modelling  for  Automatic  Speech  Recognition 
(ASR)  involves  the  study  of  formal  methods  of  grammatical  inference;  in  particular,  the  in¬ 
ference  of  stochastic  context-free  grammars  The  Inside-Outside  algorithm  [2j  re-estimates 
the  rewrite  rule  probabilities  of  a  stochastic  context-free  grammar  from  many  examples  of 
data  strings  (which  may  be  words,  sentences,  sequences  of  grammatical  tags,  mathematical 
expressions,  or  anything  that  could  be  explained  by  a  context-free  grammar)  The  Inside- 
Outside  algorithm  is  also  known  as  Baker’s  algorithm  as  it  is  based  on  his  'nodal  span’ 
principle  which  generalises  and  extends  the  techniques  used  in  Hidden  Markov  modelling 
(4)  to  stochastic  context-free  grammars  The  crucial  idea  is  that  the  hidden  random  vari¬ 
ables  are  associated  with  spans  (i.e.  intervals  or  substrings  of  the  data  strings)  rather  than 
with  single  sample  time"  as  in  the  (finite-state)  Hidden  Markov  models 

This  paper  describes  the  mathematics  of  the  Inside-Outside  algorithm,  a  procedure  for 
programming  the  algorithm  and  some  early  results  of  grammatical  inference  work  using 
simple  context-free  grammars  and  small  structured  samples  of  English  words.  Section  2 
briefly  introduces  stochastic  formal  grammars  (sFGs)  and  gives  an  example  of  a  stochastic 
context-free  grammar  (For  a  more  detailed  description  of  SFGs  see  [3] .)  Section  3  describes 
the  mathematics  of  the  Inside-Outside  (I-O)  algorithm  Section  4  briefly  discusses  the 
computer  programming  aspects.  Section  5  describes  some  exploratory  experiments  using 
the  1-0  algorithm  on  simple  structured  word  strings  Section  6  briefly  shows  how  the 
inferred  stochastic  context-free  grammar  can  be  used  to  find  the  maximum  likelihood  parse 
of  a  string.  The  final  section  is  a  discussion  about  the  problems  of  using  the  1-0  algorithm 
for  general  grammatical  inference. 

2  Stochastic  Formal  Grammars 

2.1  Introduction 

A  stochastic  formal  grammar  (SFG)  can  be  used  to  specify  languages  and  to  describe 
physical  patterns  and  data  structures.  A  SFG  can  also  be  used  in  a  generative  capacity. 
The  rewrite  rules  of  the  SFG  have  associated  probabilities  which  can  be  used  in  a  random 
sampling  process  to  generate,  for  example,  letters  to  form  a  word  string.  The  probabilities 
for  all  the  rewrite  rules  with  the  same  left  hand  side  sum  to  unity.  For  example. 
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The  symbols  C  and  D  are  non-terminal  symbols  which  are  rewritten  according  to  the 
stochastic  rules  as  AB  or  BA  (These  rules  will  be  called  non-terminal  rules.)  The  special 
(start)  symbol  S’  denotes  the  start  of  the  rewrite  process.  Pre-terminal  symbols  are  those 
non-terminal  symbols  which  can  only  be  rewritten  as  terminal  symbols.  The  pre-terminal 
symbols  are  A  and  B  (The  rules  which  rewrite  pre-  terminals  will  be  called  terminal 
rules.)  The  grammar  will  generate  sequences  of  terminal  symbols  such  as  'deal',  'lead',  etc. 
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Figure  1:  A  derivation  tree  for  the  word  ‘deal’ 

(Capital  italicised  letters  will  be  used  to  denote  non-terminal  symbols  and  lower  case  letters 
to  denote  terminal  symbols.) 

2.2  Tree  Diagrams 

A  context-free  grammar  can  be  thought  of  as  defining  a  set  of  tree  diagrams  Each  tree  is 
labelled  with  a  start  symbol,  S,  at  the  root  node,  with  terminal  symbols  at  the  leaves  and 
non-terminal  symbols  labelling  the  inner  nodes.  Each  node  and  its  associated  branches  will 
be  called  a  sub-tree  and  each  sub-tree  corresponds  to  a  production  rule  in  the  grammar 
Each  tree  has  as  its  leaves  a  sequence  of  terminal  symbols  Each  tree  has  a  probability- 
associated  with  it  which  is  the  product  of  the  probabilities  associated  with  the  sub-trees 
The  sum  of  all  these  tree  probabilities  is  unity.  Any  particular  sequence  of  terminal  symbols 
may  have  more  than  one  tree  representation  and  the  probability  of  the  terminal  string 
(given  the  grammar)  is  the  sum  of  the  probabilities  of  all  these  trees  The  usual  role  of 
a  tree  diagram  is  to  illustrate  the  parse  (or  derivation)  of  a  word,  sentence  or  data  string 
according  to  the  grammar. 

For  example,  a  derivation  tree  for  the  word  ‘deal’  according  to  the  grammar  in  (  1)  is 
as  shown  in  Figure  1.  The  probability  of  the  word  ‘deal’  being  generated  by  the  grammar 
rules  (which  is  shown  in  Figure  1)  works  out  at  about  one  chance  in  thirty-three  (or  0.03). 
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2.3  Stochastic  Context-free  Grammars 


If  the  rewrite  rules  of  the  grammar  are  of  the  form: 

A  -  -)• 

(*  is  the  Kleene  operator  which  denotes  a  sequence  or  repetition)  where 

Ac  VN 


and 

1  c  \'s  U  VT 

where  Vs  and  Vj-  are  the  sets  of  non-terminal  and  terminal  symbols,  respectively,  (that  is, 
1  is  either  a  non-terminal  symbol  or  a  terminal  symbol  and  A  is  a  non-terminal  symbol), 
then  the  grammar  is  a  context-free  grammar  (CFG) 

In  a  context-free  grammar  any  sequence  of  terminal  and  non-terminal  symbols  can 
appear  on  the  right  hand  side  of  the  rewrite  rule  but  the  left  hand  sides  must  consist  of 
one  non-terminal  symbol  only  Any  CFG  (even  when  stochastic)  can  be  transformed  into  a 
more  useful  form  (for  the  purposes  of  the  1-0  algorithm)  called  the  Chomskv  Normal  Form 
(CNr)  ji; 

A  —  BC 
C  -  c 

Non-terminal  rules  in  CNF  have  only  two  non-terminal  symbols  on  their  right  hand  side 
which  limits  the  derivation  tree  to  binary  branches 


3  The  Inside-Outside  Algorithm 

3.1  Introduction 

The  Inside-Outside  (1-0)  algorithm  re-estimates  the  probabilities  associated  with  each  of 
the  rules  in  a  stochastic  context-free  grammar  given  many  examples  of  terminal  strings 
which  have  been  (or  could  have  been)  generated  by  a  SCFG.  The  1-0  algorithm  can  only 
accept  and  work  with  rules  which  are  written  in  CNF.  Initially,  either  random  probabilities 
or  probabilities  which  represent  some  prior  knowledge  about  the  grammar  can  be  assigned 
to  initialise  the  algorithm.  Strings  from  a  sample  training  set  are  read  into  the  algorithm 
and  ‘nodal  span’  probabilities  are  calculated  At  the  end  of  the  training  sample,  the  rewrite 
rule  probabilities  are  updated  and  the  process  continues  to  the  next  iteration.  In  effect, 
the  1-0  algorithm  considers  all  possible  parses  of  the  input  string  according  to  its  current 
rewrite  rule  probabilities  and  counts  the  number  of  times  each  of  the  rules  is  used  After 
one  pass  through  a  training  set,  it  normalises  and  weights  these  counts  to  give  an  estimate 
of  the  probability  of  each  rule 

3.2  Terminology 

The  term  ‘interval’  is  used  to  refer  to  a  substring  of  the  input  string  and  it  is  usually 
associated  with  a  particular  node  in  the  tree  diagram  (In  Baker’s  terminology  an  interval 
is  referred  to  as  a  ‘span’  but  the  term  interval  will  bt  used  in  this  paper  to  distinquish  it 
from  the  use  of  ‘span’  as  a  verb  )  Sub-tree  probabilities  are  calculated  for  every  interval 
(i.e  from  intervals  of  length  one  to  the  entire  string)  and  for  every  node  in  the  tree  which 
could  ‘span’  the  interval  In  following  sections  these  sub-tree  probabilities  will  be  called  the 
e  and  /  values.  For  example,  in  Figure  2  the  interval  (delimited  by  square  brackets)  |dej 
has  length  two  and  the  node  that  spans  [de]  is  labelled  C . 
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Figure  2:  The  interval  [de]  spanned  by  node  labelled  C 

3.3  Notation 

The  rewrite  rule  probabilities  fall  into  two  distinct  types  The  probabilities  which  are 
associated  with  non-terminal  rules  will  be  called  the  A-matrix  probabilities  and  the  terminal 
rule  probabilities  will  be  called  B-matrix  probabilities 

The  rewrite  rule  label,  — •  label,  label  k  ,  1 .0  is  expressed  ir.  A-niatrix  notation  as  follows: 

a.,k  =  1.0 

A  rewrite  rule  such  as,  label,  — •  terminal *  ,0.5  is  expressed  as  : 

bjk  =  0.5 

In  terms  of  the  derivation  trees,  the  B-matrix  probabilities  are  associated  with  the  leaf 
(terminal)  nodes  and  the  A-matrix  probabilities  describe  the  statistics  of  the  branches  across 
the  set  of  trees.  The  e  and  /  values  are  denoted  by  ejs,f,t'j  and  /(«,<,«]  where  «  marks  the 
first  character  of  the  interval,  f  marks  the  final  character  in  the  interval  and  i  represents 
the  labelled  node  in  the  tree  which  spans  the  interval 

3.4  The  e  and  /  values 

Assume  that  there  are  L  strings,  dj.dj,  ...,d(,  in  a  training  set.  The  hb  string  is 

O,  ■  O't,  and  its  length  is  7j.  For  example,  grammar  (  1)  in  Section  2.1  could  generate 
as  string,  d\,  the  word  ‘deal’  so  that  T\  is  4  and  O]  .O]  are  ‘d’.T  respectively.  The 
e-value,  e[s,f,i],  is  the  probability  of  the  interval  0,  0,  given  that  the  non-terminal,  », 
spans  the  interval.  The  /-value,  /[s,t,i],  is  the  probability  of  the  intervals  and 

0(+i...0r  given  that  the  non-terminal,  t,  spans  the  interval  Of  ...Ot  So  the  e-value  gives 
the  probability  of  the  sub-trees  ‘inside’  the  span  of  the  non-terminal  and  the  /-value  gives 
the  probability  of  the  sub-trees  ‘outside’  the  span  of  the  non-terminal. 
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3.5  Calculation  of  the  e  and  /  values 


The  1-0  algorithm  is  a  kind  of  two-stage  parsing  process  The  bottom  up  process  generate® 
the  e  values  and  effectively  considers  all  possible  configurations  of  the  parse  tree  ‘inside’ 
the  section  below  the  node  spanning  the  interval  (See  Figure  3)  The  top  down  process 
generates  the  /  values  and  effectively  considers  all  possible  labelled  sub-trees  'outside'  the 
section  spanning  the  interval  (See  Figure  4)  For  example,  given  that  the  algorithm  is 
initialised  with  the  probabilities  of  the  SCFG  as  given  in  Section  2  1  and  the  input  string 
is  ‘deal’,  then  the  basic  steps  in  the  1-0  algorithm  are  as  follows 

1.  Assign  the  current  B-matrix  probabilities  to  the  e  values  for  the  first  stage  in  the 
bottom  up  process  (i  e  for  spans  of  length  1) 

The  only  non-zero  probabilities  are  : 


e|l,  l.Bj 

-  I>bo  j 

-  bsd 

=  0.3 

e[2, 2,  A] 

= 

-  bfit 

=  0  6 

e'3.3.  A] 

=  bAO‘ 

=  bAa 

—  0  4 

e!4, 4,  B, 

=  * BO‘t 

-  bpi 

=  07 

2.  Now  consider  all  possible  parses  of  the  intervals  of  length  two  using  the  e  values,  as 
assigned  in  step  1,  and  the  existing  A-matrix  value  for  all  possible  nodes  The  only  non-zero 
probabilities  are  : 

e[l,2,Cj  =  f[l.  1,  S|  e'2.2 ,A]  aCBA 

and 


e(3, 4,  D]  =  e[3,3,A|e|4,4,S|apaB 

(Using  this  particular  grammar  there  are  no  alternative  trees  for  the  word  'deal  and  so 
there  is  only  one  term  in  the  e-value  probability  calculation  ) 

3  Now  consider  the  intervals  of  length  three  (and  so  on  )  so  that  all  possible  parses  of 
the  interval  given  the  node  are  added  into  the  e-value  The  general  equation  for  the  e-value 
is  . 

e[s, f ,  t]  =  52  e(s,  r,  j j  e((r  +  l),f,fr]  a,jt 
'•.j.t 

where  r  varies  between  s  and  t 

4.  The  final  step  in  the  bottom  up  process  calculates  the  e-value  for  the  entire  string 
This  e-value,  e[l,Tj,S],  is  the  probability  of  the  word  ‘deal’  given  the  current  SCFG  (i.e. 
the  current  A-matrix  and  B-matrix). 

5.  The  first  step  in  the  top  down  process  is  to  set  /(I ,  Tj,  S)  =  1.0. 

6.  The  /  values  are  then  calculated  using  the  /  values  which  were  calculated  in  previous 
steps  of  the  top  down  process  and  the  e  values  as  appropriate  to  the  portions  of  the  string 
outside  the  intervals The  general  equation  for  the  /  values  is  : 

/!«■*.«)  =  +  52  /[*■  r.»c(f  + 

r,j,k  r,j,k 

where  in  the  first  term  r  varies  between  1  and  s  -  1  (i.e.  for  the  substring  on  the  left  hand 
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Figure  5  Calculation  of  the  weight? 

side  of  [«..<;)  and  in  the  second  term  r  varies  between  f  +  1  and  T,  (i  e  for  the  substring 
on  the  right  hand  side  of  [.«  fj) 

The  1-0  algorithm  for  SCFGs  is  analogous  to  the  Forward  Backward  algorithm  for 
Hidden  Markov  models  [4]  and  uses  the  same  kind  of  iterative  Dynamic  Programming 
technique  The  a  and  fl  in  the  Forward  Backward  algorithm  are  the  equivalent  of  the  r 
and  /  values,  respectively 

3.G  The  Re-estimation  Process 

Before  describing  the  re-estimation  process  it  is  useful  to  define  some  extra  notation  so  that 
the  equations  simply  become  a  summation  of  weights  : 

"’.•to*  =  Er  /!-'•'■']  r!r  +  l.t.kl  a,,k 

u-,„  =  f\s.t,i\  (= 


The  weight.  ts  the  probability  of  the  string  O,  O T  given  that  the  interval  O,  O, 
is  spanned  by  non-terminal,  i  The  weight,  u1  .  is  the  probability  of  the  string  O,  Oj 
given  that  the  interval  0,  0,  is  spanned  by  non-terminal,  t  and  that  the  non-terminal,  t. 
rewrites  as  non-termmals.  j  and  k  (See  Figure  5  ) 

The  A-matrix  probabilities  are  updated  as  follows  - 


a.;*  = 


Eh, 

Eii.P,' 


E,, 


1  ftijk 

,,.l 


(2) 


where  pi  is  the  probability  of  the  Ith  string  according  to  the  current  probabilities  and  it  ts 
represented  in  the  algorithm  as  e|l,T|,Sj  (the  /  subscript  represents  the  string  index). 
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So  the  ‘inside'  and  ‘outside'  parsing  probabilities  (i  e  the  e  and  f  values)  are  combined 
with  the  current  rewrite  probability  for  each  binary  branch  and  this  is  normalised  by  dividing 
by  the  sum  of  all  possible  binary  branches  to  give  the  new  A-  matrix  values.  The  probability 
of  the  stting,  pi,  is  used  to  weight  the  statistics  in  favour  of  the  more  unlikely  strings  to 
prevent,  the  rule  probabilities  ‘feeding’  only  from  the  more  common  strings  as  the  iterative 
process  continues. 

Frequency  counts  of  the  strings  can  be  used  quite  simply  in  the  re-estimation  equation? 
by  multiplying  the  weight  summation.  Denoting  the  frequency  of  the  Ith  string  by  freq< 
the  re-  estimation  equation  becomes  : 


Efr-J  fr*<H  P['  E,(  "'J,„ 


The  B-matrix  probabilities  are  updated  similarly  . 


(3) 


v-»L  _-l  p 

Ei_  i  Pi  Eio; --k 
bjk  --  -  ^ 


*1-1  Pi 


1  Y\ 


(4) 


so  that  the  numerator  only  contains  those  weights  for  which  the  entry  at  position  (  in  the 
string  is  equal  to  terminal  k.  The  string  frequency  can  be  used  as  in  equation  (  3)  : 


EfL, /feo,  P,'1 

Y.K  1  fre1i  Pi'  E,<4, 


(S) 


4  The  Computer  Program 

4.1  Introduction 

The  Inside-Outside  algorithm  computer  program  is  written  in  modular  form  in  VAX  PAS¬ 
CAL.  Details  of  how  to  run  the  program  and  the  1-0  algorithm  demonstration  are  given  in 
the  Appendix  The  following  sections  describe  each  of  the  modules. 


4.2  The  Main  Program 

The  main  program  simply  reads  in  all  the  control  parameters  (such  as  the  number  of  non¬ 
terminal  labels,  etc  and  the  various  file  names)  it  then  performs  the  main  iteration  loop 
for  the  required  number  of  training  cycles 


4.3  Initialisation 

This  small  module  simply  initialises  the  weights  to  zero  at  the  beginning  of  each  iteration 

4.4  Probability  Assignment 

The  A-matrix  and  B-matrix  probability  values  are  initialised  according  to  the  user's  choice 
from  the  following  three  options: 

•  random  numbers  between  0  and  1 

•  uniform  values 
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•  probabilities  taken  from  a  file  which  represent  prior  knowledge  about  the  grammar  or 
from  a  previous  run 

All  probabilties  are  then  normalised  to  conform  with  the  conditions  for  a  SCFG  in 
the  case  of  the  use  of  prior  knowledge  about  a  particular  grammar  the  probabilities  in 
the  ‘prior  knowledge'  file  are  slightly  reduced  due  to  all  other  (i.e.  unspecified)  matrix 
probabilities  being  initialised  as  'a  small  number’  to  avoid  initialising  probabilities  as  zero 
(If  probabilities  have  zero  value  then  they  remain  at  zero  due  to  the  multiplicative  functions 
in  the  algorithm  ) 

4.5  Calculation  of  e  and  /  values 

Before  each  iteration  the  e  and  /  values  are  initialised  to  zero.  The  values  are  then  simply 
accumulated  over  the  training  set.  The  bottom  up  process  is  controlled  using  the  following 
loops  : 

interval  length  goes  from  1  up  to  string  length 

s  goes  from  1  to  6tring  length- span  length* 1 

The  top  down  process  is  controlled  using  the  following  loops  : 

interval  length  goes  from  string  length  down  to  1 

f  goe6  from  string  length  to  span  length 

The  variable  r  moves  between  s  and  I  and  the  t,  j  and  k  nodes  assume  the  values  of  all 
possible  labels  up  to  the  limits  specified  for  the  non-  and  pre-terminals. 

4.6  Matrix  Update 

At  the  end  of  each  iteration  the  e  and  /  values  are  used  to  update  the  A-matrix  and  B- 
matrix  probabilities  as  shown  in  equations  (  3)  and  (  5)  After  the  required  number  of 
iterations,  the  values  of  the  A-matrix  and  B-matrix  and  their  corresponding  indices  are 
output  as  the  inferred  grammar  rules. 

4.7  Notes 

Pre-terminals  are  given  arbitrary  labels  in  the  1-0  algorithm  but  they  must  be  treated  quite 
distinctly  from  other  non-terminal  symbols.  Non-terminal  symbols  are  also  given  arbitrary 
name:  but  a  special  non-  terminal  symbol  must  designated  to  be  the  root  (start)  symbol. 
5.  Binary  CNF  rules  (e  g.  5  — •  AB )  are  assumed  throughout. 

5  The  Experiments 

This  section  describes  some  of  the  exploratory  work  which  was  carried  out  to  validate  and 
verify  the  computer  programs  and  to  examine  the  general  behaviour  of  the  algorithm 
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5.1  The  Input  Data 

In  order  to  provide  training  data  for  the  algorithm,  the  computer  readable  LOB  (Lancaster- 
Oslo- Bergen)  corpus  of  English  words  was  used,  and  all  words  marked  in  the  corpus  as  proper 
nouns  and  foreign  words  were  excluded.  Hyphenated  words  and  those  containing  apostophes 
were  included  and  the  hyphen  and  apostrophe  are  treated  as  alphabetic  characters.  Several 
small  subsets  were  selected  from  the  corpus  for  the  purposes  of  testing  and  training  In  the 
example  described  in  5.2,  a  typical  training  set  is  a  subset  of  words  which  all  contain  the 
substring  ‘ght’: 


1 

brighter 

2 

bright 

3 

bought 

3 

caught 

6 

delight 

2 

eight 

4 

flight 

3 

height 

1 

insight 

6 

fight 

1 

lights 

13 

brought 

3 

caught 

5 

daughter 

2 

fighter 

2 

flights 

3 

height 

5 

light 

1 

nights 

1 

ought 

3 

rights 

1 

sights 

1 

sought 

14 

thought 

3 

tonight 

1 

upright 

4 

weight 

31 

night 

17 

right 

2 

sight 

3 

tonight 

1 

uprig  ht 

4 

weight 

31 

night 

3 

rights 

2 

slight 

2 

straight 

1 

taught 

1 

upright 

4 

weight 

where  the  number  before  each  word  is  a  frequency  count  of  the  number  of  times  that  the 
word  occurs  in  the  sample  taken  from  the  corpus. 

5.2  A  Grammar  for  spelling  -ght-  words 

The  algorithm  starts-out  with  random  rewrite  rule  probabilities  which  are  correctly  nor¬ 
malised  The  algorithm  also  needs  to  be  told  how  many  non-terminal  and  pre-terminal 
symbols  it  is  to  use  for  its  SCFG.  The  e  and  /  values  are  calculated  and  accumulated  for 
every  word  in  the  training  set  and  at  the  end  of  each  training  set  the  A-matrix  and  B-matrix 
are  updated. 

The  next  iteration  uses  the  new  A-matrix  and  B-matrix  probabilities  to  calculate  the 
e  and  /  values.  In  this  way  the  algorithm  re-estimates  the  probabilities  according  to  the 
structure  in  the  words  of  the  training  set 

6.3  Results 

The  algorithm  uses  input  data  similar  to  the  lists  in  5.1  and  the  number  of  labels  for  non¬ 
terminal  and  pre-  terminal  symbols  is  varied  in  order  to  examine  the  effects  of  changing 
these  numbers  on  the  resulting  grammar  The  number  of  matrix  updates  is  also  varied  to 
discover  the  rate  of  change  of  the  grammar  as  the  algorithm  is  exposed  to  more  training 
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Results  on  -ght-  words 

updates 

non, pre- terminals 

Aver  word  prob 

20 

20,8 

0.0001020 

20 

24,12 

0.0000802 

30 

20,8 

0.0069200 

30 

24,10 

0.0133443 

30 

24,12 

0.0213200 

30 

22,10 

0.0216900 

Table  I:  Results  of  grammatical  inference  for  -ght-  words 

data  The  results  in  Table  5.3  were  obtained  and  are  listed  in  ascending  order  of  probability 
value 

The  first  column  shows  the  number  of  matrix  updates  (i  e  the  number  of  training 
set  iterations)  before  the  grammatical  inference  process  is  stopped.  The  second  column 
gives  the  upper  limit  on  the  number  of  non-  terminal  and  pre-terminal  symbol  labels  that 
the  algorithm  uses  The  third  column  shows  the  average  word  probability  taken  over  all 
the  words  in  the  training  set  given  the  SCFG  that  the  algorithm  has  inferred  in  its  final 
iteration.  Figure  6  shows  how  the  inferred  grammars  gradually  get  better  at  explaining 
the  data  in  the  training  sets.  The  grammar  ‘score’  is  the  logarithm  of  the  average  word 
probability  at  the  end  of  each  iteration  It  takes  several  iterations  for  the  node  labels  to 
become  organised  and  then  the  grammar  score  rapidly  increases  The  rate  of  increase  slow  s 
down  until  further  iterations  produce  only  small  improvements 

The  inferred  SCFG  can  be  used  in  a  simple  random  sampling  process  to  generate  syn¬ 
thetic  ‘word  strings’.  The  following  list  of  words  was  generated  from  the  grammar  w  hich 
resulted  from  the  run  with  30  updates,  22  non-terminals  and  10  pre-terminals  : 


night 

pnght 

light 

fright 

f  enight 

aighi 

emghl 

light 

baight 

night 

sight 

right 

eight 

light 

night 

wwnight 

night 

night 

enight 

night 

eight 

night 

eight 

rught 

sright 

night 

tnight 

night 

light 

aight 

rught 

aight 

elught 

night 

eight 

enight 

cmght 

night 

night 

unnight 

night 

night 

uh  ght 

night 

lught 

night 

night 

lught 

fecfntght 

umight 

6  A  Maximum  Likelihood  Parser 

6.1  Introduction 

The  inferred  SCFG  can  be  used  to  generate  the  parse  of  an  input  word  string  which  gives 
a  maximum  likelihood  value.  (This  is  called  the  maximum  likelihood  parse  )  The  e  values 
are  calculated  in  the  normal  bottom  up  process  but  the  maximum  e-value  is  stored  for  each 
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Figure  7:  Maximum  likelihood  parse  tree  (or  the  word  ‘night' 


node  label  and  interval  Pointers  are  set  to  the  corresponding  j  and  k  (and  r)  so  that  in 
the  top  down  process  the  tree  can  be  traversed  to  generate  the  maximum  likelihood  parse 
The  process  is  basically  one  of  dynamic  programming;  each  interval  (or  span)  is  a  stage  and 
the  non-terminal  node  labels  are  the  states  over  which  maximisation  occurs  at  each  stage 
The  recursive  procedure  calculates  the  maximum  likelihood  parse  at  each  stage  by  taking 
into  account  the  maximum  likelihood  parses  at  previous  stages  (i.e.  for  increasing  size  of 
interval).  The  final  stage  is  reached  at  the  root  node  (which  spans  the  whole  input  string) 
where  the  value  of  the  recursive  maximisation  function  is  the  maximum  likelihood  of  the 
string  and  the  pointers  which  were  set  at  each  stage  can  be  traced  back  down  the  tree  to 
give  the  parse. 

As  an  example,  the  maximum  likelihood  parse  of  the  word  'night'  is  as  shown  in  Figure 
7  for  the  grammar  inferred  from  30  updates  of  -ght  training  sets  with  22  non-  terminals  and 
10  pre-terminals.  (The  node  labels  are  those  used  within  the  program  ) 

6.2  Parsing  using  simple  word  tags 

As  a  simple  extension  of  the  word-spelling  experiments  the  sequences  of  single  letter  word 
tags  from  the  LOB  corpus  are  analysed  using  the  1-0  algorithm  Sentences  of  length  less 
than  six  words  are  extrected  from  the  LOB  corpus  and  the  first  letter  of  the  tag  for  each 
word  in  the  sentence  is  written  into  a  string  which  represents  the  broad  part  of  speech 
analysis  of  the  sentence  For  example,  the  sentence  ‘The  cat  sat  on  the  mat  '  would  be 
represented  as  'ANV1AN'  (i.e.  article, noun, verb, preposition, article, noun)  Such  strings  are 
used  as  training  data  for  the  algorithm  The  inferred  grammar  rules  could  then  be  used  to 
find  the  maximum  likelihood  parse  of  test  sentences  such  as  the  one  in  Figure  6.2 

7  Discussion 

The  results  of  the  simple  -ght-  word  experiments  show  that  too  many  non-terminal  node 
labels  can  give  lower  grammar  scores  due  to  the  extra  processing  required  to  organise  the 
greater  number  of  symbols. 
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long  memo 


Figure  8  Maximum  likelihood  parse  tree  for  the  tag  string  AJNVR 

A  serious  problem  with  the  1-0  algorithm  is  one  of  'underflow'.  In  other  words,  the 
e  and  /  value  calculations  involve  multiplications  of  (often)  very  small  probabilities  which 
quickly  approach  the  lower  limit  for  a  real  value  in  computer  terms.  So  far,  the  small  «  imple 
training  sets  have  not  caused  any  underflow  problems  but  future  work  (which  is  now  being 
carried  out)  on  general  English  spelling  is  more  likely  to  run  into  underflow  problems  so, 
to  get  around  the  problem,  the  e  and  /  calculations  and  matrix  re-estimations  make  use  of 
logarithms  (and  so  addition  rather  than  multiplication). 

The  algorithm  is  processor  intensive  (in  particular,  the  e  and  /  calculations)  in  terms 
of  computing  time  for  each  iteration.  The  processor  time  scales  with  the  number  of  non¬ 
terminals  and  terminals,  the  number  of  iterations  and  the  length  of  the  input  strings,  so  to 
infer  grammars  which  have  many  symbols  and  which  can  generate  long  strings,  the  A-matrix 
and  B-matrix  would  have  to  be  initialised  with  probabilities  which  reflect  prior  knowledge 
about  the  grammar  (rather  than  random  numbers),  in  order  to  reduce  the  number  of  iter¬ 
ations  necessary  to  achieve  a  grammar  with  a  high  score  (see  Figure  6).  The  algorithm  is 
well  suited  to  parallel  processing,  however,  as  the  e  and  /  values  could  be  calculated  quite 
separately  for  each  string  in  the  training  set.  Alterative!)-,  or  in  addition,  parallel  or  vector 
processors  couls  be  used  for  the  e  and  /  calculations  by  exploiting  their  structure,  which  is 
basically  that  of  matrix  multiplication. 
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8  Appendix 

The  1-0  algorithm  program  is  in  [DODD. 10]  and  the  following  modules  need  to  be  linked 

•  BAKERMAIN 

•  ASSIGN 

•  INIT  WTS 
.  EFCALC 

.  UPDATE  MATRICES 

•  AEDMOD 

.  AEDLIB  (LIBRARY) 

The  VMS  command  to  link  the  above  modules  is  called  LINKBAKER  COM  and  it  is  exe¬ 
cuted  by  typing  ©LINKBAKER  BAKERTEST  COM  runs  a  test  version  of  the  algorithm 
and  an  example  of  the  control  data  for  the  perticular  test  run  is  given  below 

set  def  [dodd.io] 

run  bakermain 

random  (*  other  options  are  'prior  knowledge'  and  'uniform'  *) 
baker  out  <*  name  of  general  output  file  *) 

genfil  inp  (*  name  of  file  for  final  inferred  grammar  rules  ») 
prifil2  inp  (*  prior  knowledge  rule  file  -  if  used  *) 
matfil  out  (*  file  for  final  a-matrix  and  b-matrix  *) 

3895  (*  initial  random  number  seed  *) 

3  (*  number  of  non-terminals  *) 

2  (*  number  of  pre-terminals  *) 

1  (*  number  of  training  sets  *) 

N  (*  selection  Y/h  for  uppercase/lowercase  input  *) 

N  (*  Y/N  for  spaces  between  terminals  in  input  strings  *) 

Y  (*  Y/H  for  AED  display  *) 

aedfil2  inp  (*  file  for  AED  display  *) 

expert  (*  control  data  for  the  colour  routines  *) 

x*10,y*510,show 

n*0 , rgb*0 , 0 , 100 , j 

n*l ,rgb*3*0, j 
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n*255 , bc 


exit  (*  control  instructions  lor  AED  display  ») 
abc  inp  (*  name-  of  file  lor  training  set  ») 

0.01  (*  cut  olf  value  lor  output  to  genlil  *) 

If  an  AED  dispa)  is  not  required  then  the  AEDfilename  and  the  control  data  for  the  AED 
displays  should  not  be  included  in  the  command  file 
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